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BRS 


L1 
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addressing WITH preindirecting 


USPAT; 
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UB; 

EPO; 

JPO; 
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U C W\ ■¥ C 

NT; 

IBM.TD 
B 
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USPAT; 

US-PGP 

UB; 

EPO; 

JPO; 
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NT; 
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EPO; 

JPO; 
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UB; 

EPO; 

JPO; 
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NT; 

IBM TD 
B 


2004/09/14 
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USPAT; 

US-PGP 

UB; 

EPO; 

JPO; 

UCIVVlC 

NT; 

IBM TD 
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USPAT; 

US-PGP 

UB; 

EPO; 

JPO; 

nCDUUE 

Licnvvc 
NT; 

IBM_TD 
B 


2004/09/14 
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L7 
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DERWE 

NT; 

IBM_TD 
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L8 
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USPAT; 

US-PGP 

UB; 

EPO; 

JPO; 

DERWE 

NT; 

IBM_TD 
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L9 
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7 and preindex$5 


USPAT; 

US-PGP 

UB; 

EPO; 

JPO; 

DERWE 

NT; 

IBM TD 
B 
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USPAT; 

US-PGP 

UB; 

EPO; 

JPO; 

DERWE 

NT; 

IBWLTD 
B 


2004/09/14 
14:43 
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BRS 


L10 


9 


7 and pre-index$5 


USPAT; 
US-PGP 
UB; 
EPO; 
JPO; 
DERWE 
I NT; 
IBM TD 
B 


2004/09/14 
14:43 
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Alan Jay Smith 
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2 Data and memory optimization techniques for embedded systems 

P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. 
Vandercappelle, P. G. Kjeldsberg 

April 2001 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 6 Issue 2 

, ul _= .„^ nnJiiJ , n , Additional Information: full citation , abstract, references , citings, index 
Full text available: *pS pdf 339.91 KB; 

~ .verms 

We present a survey of the state-of-the-art techniques used in performing data and 
memory-related optimizations in embedded systems. The optimizations are targeted directly 
or indirectly at the memory subsystem, and impact one or more out of three important cost 
metrics: area, performance, and power dissipation of the resulting implementation. We first 
examine architecture-independent optimizations in the form of code transoformations. We 
next cover a broad spectrum of optimizati ... 

Keywords: DRAM, SRAM, address generation, allocation, architecture exploration, code 
transformation, data cache, data optimization, high-level synthesis, memory architecture 
customization, memory power dissipation, register file, size estimation, survey 



3 Compiling nested data-parallel programs for shared-memory multiprocessors 
Siddhartha Chatterjee 

July 1993 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 15 Issue 3 

Full text available: soft 4 17 MB) Additional Information: full citation , references , citings, index terns , review 



Keywords: compilers, data parallelism, shared-memory multiprocessors 
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November 1980 Proceedings of the 13th annual workshop on Microprogramming 

Additional Information: fall citation, abstract , references, citings , index 



Full text available: Mpciffl :21. MBi 

terms 

A new microprogrammable computer with low-level parallelism was built and has been 
utilized as a research vehicle for solving different classes of research-oriented applications 
such as real-time processings on static/dynamic images, pictures and signals, and 
emulations of both existing and virtual machines including high (intermediate) level 
language machines. The design goal of a research-oriented computer, QA-1, was to achieve 
a high degree of processing power and system flexi ... 

The Ciip p er processor; instruction set architec ture and impl ementa tion 
W. Hollingsworth, H. Sachs, A. J. Smith 
February 1989 Communications of the ACM, volume 32 Issue 2 

Full text available: ^ pdf(4.67 MB- Addltional Information: Miration, abstract, references, citings, index 

fo. rrns , review 

Intergraph's CLIPPER microprocessor is a high performance, three chip module that 
implements a new instruction set architecture designed for convenient programmability, 
broad functionality, and easy future expansion. 

6 Tango;...a hardware-based 
Shlomit S. Pinter, Adi Yoaz 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium on 
M i croa rch itecture 

Full text available: ^ odf ^ ^ ^8) iff Additional Information: full citation , abstract , references , citings , index 
Pu blisher Site 

We present a new hardware-based data prefetching mechanism for enhancing instruction 
level parallelism and improving the performance of superscalar processors. The emphasis in 
our scheme is on the effective utilization of slack time and hardware resources not used for 
the main computation. The scheme suggests a new hardware construct, the program 
progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the 
PPG for implementing a fast pre-program counter pre-PC, that ... 

Keywords: LRU mechanism, SPEC92 benchmark, Tango, base line architecture, branch 
target buffer, hardware resources, hardware-based data prefetching technique, instruction 
level parallelism, instruction prefetching, memory reference instructions, parallel processing, 
performance, program progress graph, simulation results, slack time, superscalar processors 
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todd M. Austin, Dionisios N. Pnevmatikatos, Gurindar S. Sohi 

May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture volume 23 issue 2 

Full text available: f|pdf(1.5S MB) Additional ,nformatlon: dilation, ^ 
* ' terms 

For many programs, especially integer codes, untolerated load instruction latencies account 
for a significant portion of total execution time. In this paper, we present the design and 
evaluation of a fast address generation mechanism capable of eliminating the delays caused 
by effective address calculation for many loads and stores. Our approach works by predicting 
early in the pipeline (part of) the effective address of a memory access and using this 
predicted address to speculatively access the ... 

8 Para llel execution of pro log program s: a survey ' 

Gopal Gupta, Enrico Pontelli, Khayri A.M. Ali, Mats Carlsson, Manuel V. Hermenegildo 
July 2001 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 23 Issue 4 
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Full text available: *gjpdf {1.95 MB) Additional Information: full citation , abstract , reference 3, citings, Index 

terms 

Since the early days of logic programming, researchers in the field realized the potential for 
exploitation of parallelism present in the execution of logic programs. Their high-level 
nature, the presence of nondeterminism, and their referential transparency, among other 
characteristics, make logic programs interesting candidates for obtaining speedups through 
parallel execution. At the same time, the fact that the typical applications of logic 
programming frequently involve irregular computatio ... 

Keywords: Automatic parallelization, constraint programming, logic programming, 
parallelism, prolog 



9 E xperimental evaluation of on-chip microprocessor cache memories 
Mark D. Hill, Alan Jay Smith 

January 1984 ACM SIGARCH Computer Architecture News , Proceedings of the 11th 

annual international symposium on Computer architecture, volume 12 issue 3 

r- 114. ^ , ui » i/n; Additional Information: full citation, abstract, references, citjngs, Index 

Full text available: KB) .".""""^ 



terms 

Advances in integrated circuit density are permitting the implementation on a single chip of 
functions and performance enhancements beyond those of a basic processors. One 
performance enhancement of proven value is a cache memory; placing a cache on the 
processor chip can reduce both mean memory access time and bus traffic. In this paper we 
use trace driven simulation to study design tradeoffs for small (on-chip) caches. Miss ratio 
and traffic ratio (bus traffic) are the metrics for cache p ... 

1 0 System 

Luca Benini, Giovanni de Micheli 

April 2000 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 5 Issue 2 

■- ,,x . ■■ „Ka ,*.^ c - o«. Additional Information: full citation, abstract .references, citinas, Index 

Full text available: t m\ o6t(6bb.22. KB) ; ~ 

^ terms 

This tutorial surveys design methods for energy-efficient system-level design. We consider 
electronic sytems consisting of a hardware platform and software layers. We consider the 
three major constituents of hardware that consume energy, namely computation, 
communication, and storage units, and we review methods of reducing their energy 
consumption. We also study models for analyzing the energy cost of software, and methods 
for energy-efficient software design and compilation. This survery ... 

1 1 Cent i n ugus program . op^m i^iion: A.^se.^udy 
Thomas Kistler, Michael Franz 

July 2003 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 25 Issue 4 

i- ii* ^ , u, dk*. v ..o. Additional Information: full citation, abstract, references, index terms, 

Full text available: tf| pdft8//.o/ KB; - 

review 

Much of the software in everyday operation is not making optimal use of the hardware on 
which it actually runs. Among the reasons for this discrepancy are hardware/software 
mismatches, modularization overheads introduced by software engineering considerations, 
and the inability of systems to adapt to users' behaviors. A solution to these problems is to 
delay code generation until load time. This is the earliest point at which a piece of software 
can be fine-tuned to the actual capabilities of the ... 

Keywords: Dynamic code generation, continuous program optimization, dynamic 
reoptimization 
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Ann Gordon-Ross, Susan Cotterell, Frank Vahid 
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November 2003 ACM Transactions on Embedded Computing Systems (TECS), Volume 2 

Issue 4 

Full text available: fl.pdf(887,7I KB). Additional Information: fujl.cjtati.pn, abstract, references, index ternis 

Instruction caches have traditionally been used to improve software performance. Recently, 
several tiny instruction cache designs, including filter caches and dynamic loop caches, have 
been proposed to instead reduce software power. We propose several new tiny instruction 
cache designs, including preloaded loop caches, and one-level and two-level hybrid 
dynamic/preloaded loop caches. We evaluate the existing and proposed designs on 
embedded system software benchmarks from both the Powerstone and ... 

Keywords: Loop cache, architecture tuning, embedded systems., filter cache, fixed 
program, instruction cache, low energy, low power 



1 3 How java programs interact with virtual machines at the microarchitectural level 
Lieven Eeckhout, Andy Georges, Koen De Bosschere 

October 2003 ACM SIGPLAN Notices, Proceedings of the 18th ACM SIGPLAN 

conference on Object-oriented programing, systems, languages, and 

applications, Volume 38 Issue 11 
Full text available: ^pdfi'348.8S„KB.i Additional Information: Ml cjMion, abstract, .references, indexterms 



Java workloads are becoming increasingly prominent on various platforms ranging from 
embedded systems, over general-purpose computers to high-end servers. Understanding 
the implications of all the aspects involved when running Java workloads, is thus extremely 
important during the design of a system that will run such workloads. In other words, 
understanding the interaction between the Java application, its input and the virtual 
machine it runs on, is key to a succesful design. The goal of this ... 

Keywords: Java workloads, performance analysis, statistical data analysis, virtual machine 
technology, workload characterization 



14 Compi ler transformations for high-performance computing 
David F. Bacon, Susan L. Graham, Oliver J. Sharp 
December 1994 ACM Computing Surveys (CSUR), volume 26 issue 4 

Additional Information: MLcMiQ.n, abstract, references, citings, index 



Full text available: Wi cdf(6.32 MB) 

^ terms , review 

In the last three decades a large number of compiler transformations for optimizing 
programs have been implemented. Most optimizations for uniprocessors reduce the number 
of instructions executed by the program using transformations based on the analysis of 
scalar quantities and data-flow techniques. In contrast, optimizations for high-performance 
superscalar, vector, and parallel processors maximize parallelism and memory locality with 
transformations that rely on tracking the properties o ... 

Keywords: compilation, dependence analysis, locality, multiprocessors, optimization, 
parallelism, superscalar processors, vectorization 



15 Fast detection of communication patterns in distributed executions 
Thomas Kunz, Michiel F. H. Seuren 

November 1997 Proceedings of the 1997 conference of the Centre for Advanced Studies 
on Collaborative research 

Full text available: ^_pdf(421..MBj Additional Information: full..cjtation, abstract, references, jndex ternis 

Understanding distributed applications is a tedious and difficult task. Visualizations based on 
process-time diagrams are often used to obtain a better understanding of the execution of 
the application. The visualization tool we use is Poet, an event tracer developed at the 
University of Waterloo. However, these diagrams are often very complex and do not provide 
the user with the desired overview of the application. In our experience, such tools display 
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1 6 External memory algor it hms and dat a structures: dealing with 




Jeffrey Scott Vitter 

June 2001 ACM Computing Surveys (CSUR), Volume 33 issue 2 

r- * , u « ,, 0 , 0/ .> Additional Information: full citation, abstract, references 
Full text available: pof(828.4s KB) 

Data sets in large applications are often too massive to fit completely inside the computers 
internal memory. The resulting input/output communication (or I/O) between fast internal 
memory and slower external memory (such as disks) can be a major performance 
bottleneck. In this article we survey the state of the art in the design and analysis of 
external memory (or EM) algorithms and data structures, where the goal is to exploit locality 
in order to reduce the I/O costs. We consider a varie ... 

Keywords: B-tree, I/O, batched, block, disk, dynamic, extendible hashing, external 
memory, hierarchical memory, multidimensional access methods, multilevel memory, 
online, out-of-core, secondary storage, sorting 



1 7 The perform ance.ady art 

Nihar R. Mahapatra, Jiangjiang Liu, Krishnan Sundaresan 

June 2002 ACM SIGPLAN Notices , Proceedings of the workshop on Memory system 

performance, Volume 38 Issue 2 supplement 

Full text available: ^o df(1.34 MB) Additional Information: full citation , abstract, references 

The memory system stores information comprising primarily instructions and data and 
secondarily address information, such as cache tag fields. It interacts with the processor by 
supporting related traffic (again comprising addresses, instructions, and data). Continuing 
exponential growth in processor performance, combined with technology, architecture, and 
application trends, place enormous demands on the memory system to permit this 
information storage and exchange at a high-enough performance ... 

Keywords: Markov models, address compression, bandwidth, cache, data compression, 
entropy, instruction compression, latency, lossless compression, memory, register file, 
storage, traffic 



18 Microarchite^ 

Wessam Hassanein, Jose Fortes, Rudolf Eigenmann 

June 2004 Proceedings of the 18th annual international conference on 
Supercomputing 

Full text available: "g ) pdf( 148.05 KB) Additional Information: full citation , abstract , references , index terms 

In modern architectures, memory access latency is an increasingly performance-limiting 
factor. To reduce this latency, we propose concepts and implementation of a new technique 
that uses an in-memory processor to precompute future, critical load addresses and forward 
the computed values to the main processor. The acronym for this technique is IMPT for In- 
Memory Precomputation-based forwarding Threads. IMPT combines the advantages of 
precomputation-based techniques with the low memory access late ... 

Keywords: forwarding, precomputation, prefetching, processing-in-memory 
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October 2002 Proceedings of the 15th international symposium on System Synthesis 



Full text available: ^^dfnO^G^.KB ) Additional Information: MUMtQ.a, abstract, references, index Jerms 




We propose a novel customization methodology for power reduction on the communication 
link between an embedded processor and its data memory. We target the address bus and 
show how by utilizing application information about the memory references in the data 
intensive program loops, a power efficient address communication protocol can be 
established between the processor core and the data memory. The data memory controller 
thus generates the addresses for the various data streams with minimal run ... 

20 Dynamic specu l ation and synchronization of d ata dependences 
Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, Gurindar S. Sohi 
May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, volume 25 issue 2 
Fun w av/fliiahia- m -^n ml V . Additional Information: Miration, abstract, references, citjngs, index 



Data dependence speculation is used in instruction-level parallel (ILP) processors to allow 
early execution of an instruction before a logically preceding instruction on which it may be 
data dependent. If the instruction is independent, data dependence speculation succeeds; if 
not, it fails, and the two instructions must be synchronized. The modern dynamically 
scheduled processors that use data dependence speculation do so blindly (i.e., every load 
instruction with unresolved dependences is spec ... 
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